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1 INTRODUCTION 



Because of the ethical and economic considerations in the design of clinical trials to test the 
efficacy of new treatments and because of lack of information on the magnitude and sam- 
pling variability of the treatment effect at the design stage, there has been increasing interest 
from the biopharmaceutical industry in sequential m ethods that c an adapt to information 
acquired during the course of the trial. Beginning with lBauerl (119891 ) . wh o introduced sequen- 
tial ad aptive test strategies over a planned series of separate trials, and IWittes and Brittaii 
(ll990l ). who discussed internal pilot studies, a large literature has grown on adaptive de- 
sign of clinical trials. Depending on the topics covered, the term "adaptive design" in 
this literature is sometimes replaced by "sample size re-estimation," "trial extension" or 
"internal pilot studies." In standard clinical trial designs, the sample size is determined 
by the power at a given alternative, but in practice, it is often difficult for investiga- 
tors to specify a realistic alternative at which sample size determination can be based. 
Although a standard method to address this difficulty is to carry out a preliminary pi- 
lot study, the res ults from a small pi l ot stu dy may be difficult to interpret and apply, 
as pointed out by IWittes and Brittainl (ll990l ). who proposed to treat the first stage of a 
two-stage clinical trial as an internal pilot from which the overall sample size can be re- 
estimated. The problem of sample size re-estimation based on observed treatment differ- 
ence at some time before the prescheduled en d of a clinical tr i al ha s attracted consider 
able a tt ention during the past de cade; see, e.g., 



(2000 



1993 ). iBirkett and Davl (119941) . 



Section 14.2) 



ShilJ (120011 ) and 



Gould and Shih ( 



Denne and Jennison ( 



1999 



Whitehead et al 



outcome variable s with kno wn variances, 



1992), 



2000 



Herson and Wittes 



Jennison and Turnbull 



(2001). For norma 



Proschan and Hunsbergerl (119951 ) . 



ly dist r ibute d 



Fisherl (119981 ). 



Posch and Bauerl (119991 ) and lShen and Fisherl (Il999i ) have proposed ways to adjust the test 
statistics after mid-course sample size modification so that the Type I error probability is 
maintai ned at the prescribed leve l. By making use of a generalization of the Neyman- Pearson 
lemma, iTsiatis and Mehtal ( 120031 ) showed that these adaptive tests of a simple null versus a 



simple alt ernative hypothesis are i neffic ient because they are not based on likelihood ratio 



statistics. 



Jennison and Turnbulll (120031 ) gave a general weighted form of these test statis- 



tics and demonstrated in simulati on studies that the adaptive te sts performed considerably 
worse than group sequential tests. Ijennison and Turnbulll ( l2006al ) recently introduced adap- 
tive group sequential tests that choose the jth group size and stopping boundary on the basis 
of the cumulative sample size and the sample sum S nj _ x over the first j — 1 groups, and 
that are optimal in the sense of minimizing a weighted average of the expected sample sizes 
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over a collection of parameter values subject to prescribed error probabilities at the null and 
a given alternative hypothesis. They showed how the corresponding optimization problem 
can be solved numerically by using t he backw a rd induction algorithms for "o p timal s equen- 
tially planned" designs developed by ISchmitzl ( 119931 ). iJennison and Turnbulll (l2006bl ) found 
that standard (non-adaptive) group sequential tests with the first stage chosen optimally are 
nearly as efficient as their optimal adaptive tests. 

Except for Jennison and Turnbull's optimal adaptive group sequential tests and the 



extensions o 
by 



Cui et al. 



the s a mple size re-estimation approac 



(1999) 



Lehmacher and Wassmer 



i to group sequential testing con sidered 



(Il999l ) and lDenne and Jennison! ( 120001 ) . previ 



ous works in the literature on mid-course sample size re-estimation have focused on two-stage 
designs whose second-stage sample size is determined b y the results from the first stage (in- 



ternal pilot), following the seminal work of 



Stein 



Jl945t ) i 



in this area. Although this approach 



is intuitively appealing, it does not adjust for the uncertainty in the first-stage parameter 
estimates that are used to determine the second-stage sample size. Moreover, it considers 
primarily the special problem of comparing the means of the two normal populations, using 
the central limit theorem for extensions to more general situations. The case of unknown 
com mon variance at a prespecified alternative for the mean difference was considered first, 



as m 



Steinl (119451 ) and the first set of references in the preceding paragraph. Then the case 



of known variances in the absence of a prespecified alternative for the mean difference was 
studied, as in the second set of references above. 



Bartroff and Lail (120081 ) recently gave a unified treatment of both cases in the general 



framework of multiparameter exponential families. It uses efficient generalized likelihood 
ratio (GLR) statistics in this framework and adds a third stage to adjust for the sampling 
variability of the first-stage parameter estimates that determine the second-stage sample 
size. Specifically, let Xi,X%, ... be independent d- dimensional random vectors from a mul- 
tiparameter exponential family fe(x) = exp{9'x — ip{9)} of densities with respect to some 
measure v on H d . Let S n = Xi + . . . + X n . A sufficient statistic based on (Xi, . . . ,X n ) 
is the sample mean X n = S n /n, which is the maximum likelihood estimate of the mean 
Vip(9). Consider the hypothesis u(9) = Uj, where j — or 1, u : © — > R is continuously 
differentiable and = {9 : J e e ' x du(x) < oo} is the natural parameter space. As noted by 
Lai and Shihl (120041 . p. 513), the GLR statistic Ajj for total sample size rij at stage i has the 
form 



A iJ = n i {e' ni X rH -4>(9 ni )}- sup n i {e , X ni -i>(9)} 



'■.u{6)=Uj 



inf nj(9 ni ,9), (1.1) 

0:u{6)=Uj 
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in which 6 n = (Vip) 1 (X n ) and 1(9, A) is the Kullback-Leibler information number given by 



1(9, A) = E [log{fg(X i )/f x (X i )}} = {9- \)'Vm - {$(0) - V»(A)}. 
he p ossibility of adding a third stage to improve two-stage designs dated back to 



'1.21 



Lordcn 



(1l983l ). Whereas Lorden used crude upper bounds for the T y pe I e rror probability that are 

overcome this difficulty 



Bartroff and Lai 



(|20(3) 



too conservative for practical applications, 
by developing numerical methods to compute the Type I error probability, and also extended 
the three-stage test to multiparameter and multi-armed settings, thus greatly broadening the 
scope of these efficient adaptive designs. A review of their method is given in Section 12.11 In 
Section H] we prove the as ympto tic optimality of these adaptive tests in the multiparameter 
case, extending Lorden's (119831 ) result for the special case d = 1. 

Another new addition to this asymptotic optimality theory of adaptive designs is related 



Cui et al- 



to the problem of trial extension considered in Section 12.21 As pointed out by 
(119991 ). the issue of increasing the maximum sample size sometimes arises after interim 
analysis in group sequential trials. They cited a study protocol, which was reviewed by the 
Food and Drug Administration, involving a Phase III group sequential trial for evaluating 
the efficacy of a new drug to prevent myocardial infarction in patients undergoing coronary 
artery bypass graft surgery. During interim analysis, the observed incidence for the drug 
achieved a reduction that was only half of the target reduction assumed in the calculation of 
the maximum sample size M, resulting in a p ropo sal to increase the maximum sam ple size to 



M (N max in their notation). ICui et al.l (119991 ) and iLehmacher and Wassmerl (Il999l ) extended 
the sample size re- estimation approach to adaptive group sequential trials by adjusting the 



test statistics as in 



Proschan and Hunsbergerl (119951 ) and allowing the future group sizes to 



be increased or decreased during interim analyses so that the overall sample size does not 
exceed M(> M) and the Type I error probability is maintained at the prescribed level. In 
Section 12.21 we propose an alternative approach that is shown to be asymptotically efficient 
in Section HI Whereas the adaptive designs in Section 12.11 assume a given maximum sample 
size M and require at most 3 stages, Section 12.21 extends them to allow mid-course increase 
of M to M. These adaptive designs involve no more than 4 stages and an adjustment in the 
maximum sample size is made at the third stage. Computational algorithms to implement 
them are provided in Section 12.31 and simulation results on their performance are given in 
Section 13.41 



Bartroff and Lail (120081 ) carried out comprehensive simulation studies of the performance, 



measured in terms of the expected sample size and power functions, of their adaptive test 
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and compared it with other adaptive tests in the literature. In the case of normal mean 
with known variance and Type I and II error constraints under the null and a given al- 
ternative hypothesis, th ey showed that their ad a ptive t est is comparable to the benchmark 
optimal adaptive test of Jennison and Turnbull ((2006a b), which is superior to the existing 
two-stage adaptive designs. On the other hand, whereas the benchmark optimal adaptive 
test needs t o assume a specif i ed alt ernative, these adaptive two-stage tests and the adap- 
tive tests of iBartroff and Lail (120081 ) do not require such assumptions as they consider the 



estimated alternative at the end o 
signs 



the first stage. In their recent survey of adaptive de- 
Burman and Sonessonl (j2006l ) pointed out that previous criticisms of the statistical 
properties of two-stage adaptive designs may be unconvincing in some situations when flex- 
ibility and not having to specify parameters that are unknown at the beginning of a trial 
(like the relevant treatment effect or va riance) are more i mper ative than efficiency or be- 
ing powerful. The adaptive designs in IBartroff and Lail (120081 ) and this paper can fulfill 
the seemingly disparate requirements of flexibility and efficiency on a design. Rather than 
achieving exact optimality at a specified collection of alternatives through dynamic program- 
ming, they achieve asymptotic optimality over the entire range of alternatives, resulting in 
near-optimality in practice. They are based on efficient test statistics of the GLR type, 
which have an intuitively "adaptive" appeal via estimation of unknown parameters by max- 
imum likelihood, ease of implementation and freedom from having to specify the relevant 



alternative; see Section 



2. 



Bauer and Einfalt 



(120061 ) have found from a search of the medical literature that adaptive 



designs have not been widely used in practice and that "adaptations in practice are rather 
limited to sample size reassessment." Perhaps one reason why these two-stage adaptive de- 
signs have not gained wide acceptance is their use of seemingly unnatural and convoluted 
test statistics (e.g., the inefficient test statistics mentioned in the first paragraph). This can 
be circumvented by the use of efficient GLR statistics in our adaptive tests with no more 
than 3 (or 4) stages. Another reason may be the lack of routine medical studies to which 
adaptive designs can lead to substantial improvements over current practice. In Section 13.11 
we consider one such potential application in Pha se II cancer studies. We show how our 
adaptive designs offer improvements over Simon's (119891 ) optimal two-stage de signs, which 



Thall et al. 



are co mmonly used in single-arm cancer trials, and over their analogs, due to 
(119881 ). for randomized trials in Section l3~2l Section [3731 considers another issue that often 
arises in the design of clinical trials, namely, nuisance parameters. "Very often, statisti- 
cal information also depends on nuisance parameters (e.g., the standard deviation of the 
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response variable) . Extension of st atistical information is a design adaptation that occurs 
most frequently" (IHung et all 120061 ). The simulation results in Section [3"U1 shows how our 
adaptive test resolves the difficulties with conventional two-stage designs to treat nuisance 
parameters, which are estimated at the end of the first stage and then used to estimate the 
second stage sample size. Further discussion of our proposed approach to adaptive designs 
and some concluding remarks are given in Section 



2 EFFICIENT ADAPTIVE DESIGN AND GLR TESTS 



2.1 An Adaptive 3-Stage GLR Test 



Whereas 



Tsiatis and Mehtal (120031 ) consider the case of simple null and alternative hypothe- 



ses 9 = 6j (j = 0, 1) for which like 



sequential designs, 



Bartroff and Lai 



i hood ratio tests are most powerful even in their group 
(120081 ) use the GLR statistics (11.11) in an adaptive three- 
stage test of the composite null hypothesis H : u(9) < Uo, where u is a smooth real- valued 
function such that 

1(9, A) is increasing in u(X) for every fixed 9. (2-1) 

Let rii = m be the sample size of the first stage (or internal pilot study) and TI3 = M be the 
maximum total sample size, both specified before the trial. Let u\ > uq be the alternative 
implied by the maximum sample size M and the reference Type II error probability 5. That 
is, «i(> u Q ) is the alternative where the fixed sample size (FSS) GLR test with Type I error 
pr obability a a n d sam ple size M has power mf 6):ii ( e ) = ^ 1 P {Reject H } equal to 1 — a, as 



in 



Lai and Shihl (12004 Section 3.4). The three-stage test of H : u(9) < uq stops and rejects 
H at stage i < 2 if 

rii < M, u(9 n J > u and A ifi > b. (2.2) 



Early stopping for futility (accepting H ) can also occur at stage i < 2 if 

rii < M, u(9 ni ) < ui and A^i > b. 



The test rejects H at stage % = 2 or 3 if 



rii = M, u(9 M ) > uq and A ij0 > c, 



(2.3) 



(2.4) 



accepting Hq otherwise. The sample size ri2 of the three-stage test is given by 



n 2 = m V {M A [(1 + p m )n{9 m )] } 



(2.5) 
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with 



n(6) =min{|loga|/ inf 1(6, A), | loga|/ inf 1(6, A)}, (2.6) 

X:u(X)=uq X:u(X)=ui 



where p m > is an inflation factor to adjust for uncertainty in 6 m ; see the examples in 
Sectional Letting < e, e < 1, define the thresholds b,b and c to satisfy the equations 

sup Pa{ (12.31) occurs for i — 1 or 2} = ea, (2.7) 
sup P){ (12.31) does not occur for i < 2, (12. 2 p occurs for « = 1 or 2} = ea, (2.8) 

0:u(0)=« o 

sup P e {fl2J2D and (Q do not occur for i < 2, ([H]) occurs} = (1 - e)a. (2.9) 

9:u(d)=u 

The probabilities in (I2.7p - (I2.9I) can be computed by using the normal approximation to the 
signed-root likelihood ratio statistic 

= {sign(u(0 ni ) - u J )}(2n i A i ^ 2 , 



1 < i < 3 and j = 0, 1) under u(6) = u h as in lLai and Shihl (12004 p. 513). When u(6) 



u 



£ij is approximately normal with mean 0, variance Hi, and the increments £jj — ii-ij are 
asymptotically independent. We can therefore approximate by a sum of independent 
standard normal random variables under u(6) = Uj and thereby determine b, b and c. Note 
that this normal approximation can also be used for the choice of u\ implied by M and a. 



Computational details are given in Section [273], as well as an alternate method for computing 
the thresholds by Monte Carlo. 

A special multiparameter case of particular interest in clinical trials involves K in- 
dependent populations having density functions exp{6 k x — ipk(@k)} so that 6'x — ip(6) = 
Z^k=i{^kXk — ip(6k)}- In multi-armed trials, for which different numbers of patients are as- 
signed to different treatments, the GLR statistic Ay for testing the hypothesis u(6i, . . . , 6k) = 
Uj (j = or 1) at stage i has the form 

K K 

A i,j = Yl n ki{Kn kl X k ,n kt ~ H0k,nJ} ~ SUp ^ n ki{6 k X k) n ki - ^(6 k )}, 

k=1 0:M(ei,...,6» K )=?i j k=1 

in which n k i is the total number of observations from t he kth popul a tion u p to stage i. 
Let Hi = Ylk=i n ki- As pointed out in Section 3.4 of Lai and Shih ( 2004 ). the normal 



approximation to the signed root likelihood ratio statistic is still applicable when n ki = 

i 

p k rii + O p (nf), where p\,...,pK are nonnegative constants that sum up to 1, as in random 
allocation of patients to the K treatments (for which p k = 1/K). 
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2.2 Mid-course Modification of Maximum Sample Size 

We now modify the adaptive designs in the preceding section to accommodate the possibility 
of mid-course increase of the maximum sample size from M to M. Let u 2 be the alternative 
implied by M so that the level-a GLR test with sample size M has power 1 — 5. Note that 
U\ > u 2 > u . Whereas the sample size is chosen to be M in Section |27L| we now define 

n(6) = min{|loga|/ inf 1(6, A), | loga|/ inf 1(6, X)}, 

\:u(X)=uo X:u(\)=U2 

n 3 = n 2 V {M' A [(1 + p m )n(6 n , 2 )] }, 

where M < M' < M and n 2 = m V {M A (1 + p m )n(6 m )}. We can regard the test as a 
group sequential test with 4 groups and n\ = m,n^ = M, but with adaptively chosen n 2 and 
n 3 . If the test does not end at the third stage, continue to the fourth and final stage with 
sample size = M. Its rejection and futility boundaries are similar to those in Section I2TT1 
Extending our notation Ajj in (11.11) to 1 < i < 4 and < j < 2, the test stops at stage i < 3 
and rejects H if 

rii < M, u(6 n .) > u , and A ifi > b, (2.10) 

stops and accepts H if 

m<M, u(6 n J<u 2 , and A i>2 > b, (2.11) 

and rejects H at stage i = 3 or 4 if 

rii = M, u(6~) > u Q , and A ij0 > c, (2.12) 

accepting H otherwise. The thresholds b, b and c can be defined by equations similar to 
(l2.7[) -( f279T) to insure the overall Type I error probability to be a. For example, in place of 

(O, 

sup Pfl{f l2.11j) occurs for some i < 3} = eat. (2-13) 

6:u(e)=u 2 

The basic idea underlying (I2.13P is to control the Type II error probability at u 2 so that the 
test does not lose much power there in comparison with the GLR test that has sample size 
M (and therefore power 1 — 5 at u 2 ). 

2.3 Implementation via Normal Approximation or Monte Carlo 

To begin with, suppose the Xi are N(6, 1) and u(6) = 6. We write 6j instead of Uj and, 
without loss of generality, we shall assume that 6 Q = 0. The thresholds b, b and c of the 
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three-stage test in Section 12.11 can be computed by solving in succ ession (12. TJ) , (12. 8 p and 



(12. 9p . Univariate grid search or Brent's method ( jPress et al.l . Il992l ) can be used to solve 



each equation. Since 1(9, A) = {9 — A) 2 /2, we can rewrite (12.71) as 

P 9l {S m - m9 1 > -{2bm)^,S n2 - n 2 9 1 < -(2bn 2 )^} + P 01 {S m - m9 1 < -{2bm) 1 ' 2 } = ea, 
and d2U) and as 

P {S m /(2m) l 2 >b^} + P o <S m /(2m) L 2 < , S n J (2n 2 )* >b*,n 2 <M} = ea, 
P {& < S m /{2m) 1 - < b^,n 2 < Af,6* < S n J{2n 2 )^ < bl,S M /{2M)* > c^} 

+P {65 < S m /(2m)^ < bi,n 2 = M,S M /{2M)* > c^} = (1 - e)a. 

The probabilities involving n 2 can be computed by conditioning on the value of S m / m, which 
completely determines the value of n 2 , denoted by k(x). For example, the probabilities under 
9 = can be computed via 

Po{S n2 > (2bn 2 )^\S m = mx) = P{N(0, 1) > [2bk(x)n\ - mx]/[k(x) - m]*},(2.U) 
P {S n2 e dy, S M e dz\S m = mx} = <f k (x)-m{y ~ mx)(p M _ k(x) (z - y)dydz, (2.15) 

where <p v is the N(0,v) density function, i.e., (p v (w) = (2nv)~^ exp(— w 2 /2v). The proba- 
bilities under 9\ can be computed similarly. Hence standard recursiv e numerical integration 
algori thms can be used to compute the probabilities in (12.7l) - (12.9jl ; see 



Jennison and Turnbul] 



(120001 . Section 19.2). For the general multiparameter exponential family, this method can be 
used to compute the thresholds b,b and c for (12. 2p - (12.41) since the problem can be approxi- 
mated by that of testing a normal mean, as discussed in Section 12.11 



For mid-course modification of the maximum sample size in Section 12.21 the above 
recursive numerical algorithm can be modified to handle the randomness of n 2 and n^. The 
basic idea is that conditional on S m /m = x, the value of n 2 is completely determined as k(x), 
and conditional on S m /m = x and S n2 /n 2 = y, the value of n 3 is completely determined as 
h(x,y). Therefore, analogous to (I2.15p . we now have 

P{S n3 G du, 5~ G dw\S m /m = x, S n2 /n 2 = y} 

= <Ph(x,y)-k(x)(u - VH X ))<PM-h(*,y)( W ~ U ) dudw i ( 2 - 16 ) 

and can use bivariate recursive numerical integration. For the general exponential family, 
normal approximation to the signed-root likelihood ratio statistic can again be used. 

An alternative to normal approximation is to use Monte Carlo similar to that used 
in bootstrap tests. While using Monte Carlo simulations to compute error probabilities is 
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an obvious idea, it is far from being clear which distribution from a composite hypothesis 
should be chosen to simulate from. Bootstrap theory suggests that we can simulate from the 
estimated distribution under the assumed hypothesis as the GLR statistic is an approximate 
pivot under that hypothesis. Since the "estimated distribution" needs data to arrive at the 
estimate, we make use of the first-stage data to determine b and b, then we use the second- 
stage data to determine c for the three-stage test in Section [2~TT1 Specifically, the Monte Carlo 
method to determine b, b and c proceeds as follows. At the end of the first stage, compute 
the maximum likelihood estimate 9 m j under the constraint u(9) = Uj,j = 0, 1. Determine 
b, b and c successively by solving 

P ?m {(IQl occurs for % = 1 or 2} = sot, (2.17) 
Pg { (12.31) does not occur for i < 2, and (12. 2p occurs for i = 1 or 2} = ea, (2.18) 
{Q and (J2l| do not occur for i < 2, and ([21} occurs} = (1 - e)a, (2.19) 

noting that c does not have to be determined until after the second stage when ri2 observations 
become available for the updated estimate 9 m $. The probabilities in (I2.17p - (l2.19p . with 
9 = m i or 9 = 9 m $ as indicated, can be computed by Monte Carlo simulations. Similarly, 
to determine thresholds b, b and c of the adaptive test in Section I2.2[ we can use Monte 
Carlo simulations instead of normal approximation and numerical integration to compute 
the corresponding probabilities. 



APPLICATIONS AND NUMERICAL EXAMPLES 



3.1 Application to Single-Arm Phase II Cancer Trials 



As pointed out by 



Vickersetal 



( 120071 . p. 927), in a typical phase II study of a novel cancer 
treatment, "a cohort of patients is treated, and the outcomes are related to the prespecified 
target or bar. If the results meet or exceed the target, the treatment is declared worthy of 
further study; otherwise, further development is stopped. This has been referred to as the 
'go/ no go' decision. Most often, the outcome specified is a measure of tumor response, e.g., 
complete or partial response using Response Evaluation Criteria in Solid Tumors, expressed 
as a proportion of the total number of patients. Response can also be defined in terms of 
the proportion who have not progressed or who are alive at a predetermined time (e.g., one 
year) after treatme nt is s tarted." The most widely used designs for these single-arm phase II 



trials are Simon's (119891 ) optimal 2-stage designs, which allow early stopping of the trial if 
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the treatment has not shown beneficial effect that is measured by a Bernoulli proportion. 
These designs are optimal in the sense of minimizing the expected sample size under the null 
hypothesis of no viable treatment effect, subject to Type I and II error probability bounds. 
Given a maximum sample sample size M, Simon considered the design that stops for futility 
after m < M patients if the number of patients exhibiting positive treatment effect is r x (< m) 
or fewer, and otherwise treats an additional M — m patients and rejects the treatment if 
and only if the number of patients exhibiting positive treatment effect is r2(< M) or fewer. 
Simon's designs require that a null proportion po, representing some "uninteresting" level 
of positive treatment effect, and an alternative pi > po be specified. The null hypothesis is 
H : p < p , where p denotes the probability of positive treatment effect. The Type I and II 
error probabilities P Po {Reject H }, P Pl {Accept H } and the expected sample size E Po N can 
be computed for any design of this form, which can be represented by the pa rameter vector 
(m, M,ri,r 2 ). Using computer search over these integer-valued parameters, ISimonl (119891 ) 
tabulated the optimal designs in his Tables 1 and 2 for different values of {po,Pi)- 

Whether the new treatment is declared promising in a phase II trial depends strongly 
on the prescribed po and y-\. In their systema tic review of 134 papers reporting phase II 



trials in J. Clin. Oncology, IVickers et al.l ( 120071 ) found 70 papers referring to historical data 
for their choice of the null or alternative response rate, and that nearly half (i.e., 32) of these 
papers did not cite the source of the historical data used, while only 9 gave clearly a single 
historical estimate of their choice of po- Moreover, no study "incorporated any statistical 
method to account for the possibility of sampling error or for differences in case mix between 
the phase II sample and the historical cohort." The adaptive designs in Section 12.11 applied 
to this setting do not require the specification of the alternative pi, a desirable property to 
prevent well-intentioned but misguided practitioners from choosing p\ artificially small to 
inflate the appearance of a positive treatment effect, if one exists; uncertainty in the choice 
of po is also an important issue and is addressed in Section 13.21 For now, assume p to be 
given along with initial and maximum sample sizes m and M. The adaptive test takes p\ 
to be the alternative where the FSS test, with Type I error probability a at po, has power 
1 — (3, i.e., the solution of Fm, Pi {F^I Po {1 — a)) = (3, where Fm, p is the distribution function 
of the Bin(M, p) distribution. The GLR statistic (II. ip at the ith stage is 



11: 



p n% log ( ^ ) + (1 - p ni ) log ' 



Pj J v 1 - V j 



for j = or 1, and the critical values b, b and c are chosen to satisfy (I2.7l) - ()2.9p . Because of 
discreteness of the binomial distribution it may be impossible to satisfy (I2.7p - (l2.9p exactly, 
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Table 1. Description of ADAPT and Sim2 for two cases 



S m ADAPT Sim2 fa = .3 or .44) 
(a) m = 10, M = 29, p = .1, a = .05, /3 = .2 

< 1 Accept Hq. Accept H . 

2 n 2 = M; reject Hq if S n2 > 6. n 2 = M and 

3 n 2 = 20 reject H if 5a/ > 6. 

(i) If S n2 < 3, accept Hq. 

(ii) If S n2 > 6, reject ifo- 

(iii) If 4 < S n2 < 5 and Sm > 6, reject H . 

> 4 Reject £T . n 2 = M; rej. ff if S M > 6. 

(b) m = 30, M = 82, p = .1, a = /3 = .1 

< 8 Accept H . Accept Hq. 
9 n 2 = 57 Accept Hq. 

(i) If if> na < 19, accept Hq. 

(ii) If S n2 > 24, reject F . 

(iii) If 20 < S n2 < 23 and S M > 32, reject Hq. 

10 — 13 n 2 = M; reject if if S n2 > 31. n 2 = M and 

> 14 Reject Hq. reject ifo if Sm > 30. 



in which case (12. 7p is satisfied approximately and (I2.8p - (l2.9p are satisfied conservatively. The 
stopping rule defined by (j2.2p - (!2.4p may alternatively be stated in terms of the number of 
cumulative successes S n A at th e ith stage. Table 1 describes the adaptive design (denoted by 
ADAPT) and Simon's (119891 ) optimal 2-stage design (denoted by Sim2) for two choices of 
m, M, a, /3,po and Table 2 contains their operating characteristics, computed exactly using 
the Bin(n,p) distribution. ADAPT has expected sample size close to Sim2 for p near p , and 
smaller sample size when p is roughly midway between p and p\ or is larger; p\ = .3 in the 
top panel of Table 2 and p\ = .44 in the bottom panel. It is not surprising that the expected 
sample size of Sim2 increases with p since Sim2 only stops early for futility. The expected 
number of stages shows a similar pattern, while their power functions are nearly identical. 
Note that even though ADAPT has a maximum of three stages, its expected number of 
stages is less than 2 for all p and usually close to 1. 
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Table 2. Expected sample size, power (in parentheses) and expected number of stages (in 
brackets) of phase II designs 



P 




ADAPT 






Sim2 






(a) m 


= 10, M 


= 29, 


Po = -1, 


a = .05, 


= .2 


.05 


11.6 


(.3%) 


[1.1] 


11.6 


(.2%) 


[1.1] 


Po = -1 


14.5 


(5%) 


[1.3] 


15.0 


(4.7%) 


[1.3] 


.2 


18.8 


(43.3%) 


[1.6] 


21.9 


(43.1%) 


[1.6] 


Pi = -3 


18.1 


(79.4%) 


[1.6] 


26.1 


(79.6%) 


[1.8] 


.4 


14.8 


(94.9%) 


[1.4] 


28.1 


(95.0%) 


[2.0] 


.5 


12.1 


(98.9%) 


[1.2] 


28.8 


(98.9%) 


[2.0] 


.6 


10.1 


(99.9%) 


[1.0] 


29.0 


(99.9%) 


[2.0] 




(b) 


m = 30, M = i 


12, Po = . 


3, a = (3 = 


.1 


.2 


34.9 


(.3%) 


[1.1] 


33.2 


(.03%) 


[1.1] 


Po = -3 


51.8 


(10.0%) 


[1.5] 


51.4 


(10.0%) 


[1.4] 


.35 


60.4 


(35.0%) 


[1.7] 


63.4 


(36.2%) 


[1.6] 


p x = .44 


52.9 


(88.7%) 


[1.5] 


77.7 


(87.8%) 


[1.9] 


.5 


42.4 


(98.4%) 


[1.3] 


80.9 


(97.5%) 


[2.0] 


.6 


31.9 


(99.9%) 


[1.0] 


82.0 


(99.9%) 


[2.0] 



3.2 Extension to Randomized Phase II Cancer Trials 



As noted by 



Vickers et al. 



(120071 ). uncertainty in the choice of po and p\ can increase the 
likelihood that (a) a treatment with no viable positive treatment effect proceeds to phase III, 
or (b) a treatment with positive treatment effect is abandoned at phase II. To circumvent the 
problem of choosing an artificially small po, either intentionally by a practitioner wanting to 
give the treatment the "best chance" of showing a positive effect if one exists, or unint e ntion- 
ally because of inaccurate information about the control, lEUenberg and Eisenbergerl (119851 ) 
proposed to perform a controlled 2-arm phase II trial in which patients are randomized into 
both treatment and control groups. After randomizing 2ni patients into treatment and con- 
trol arms, the trial is stopped for futility when the number in the treatment group showing 
positive effect is not greater than in the control group. Otherwise, the trial continues until a 
total of 2n2 patients have been randomized into the study and then a standard fixed sample 
binomial test is performed. Letting p and q denote th e prob ability of positive effect in the 
treatment and control groups, respectively, 



Thall et al. 



(119881 ) subsequently chose nx,n 2 and 
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two other design parameters y% and y 2 , described below, to minimize the "average" expected 
sample size 

AvSS = \[E{N\p = q) + E{N\p = q + 5)] (3.1) 

subject to Type I and II error probability constraints a and (3. The two-stage test of 
Hq : p < q stops for futility after the first stage if an approximately normally distributed 
test statistic Zi, based on the 2-population binomial data, is no greater than yi, otherwise 
continuing with a second stage and rejecting Hq if Z 2 > y 2 . Because the expectations in 
(13. ip depend on p and q, both must be specified to calculate the trial design. 

The adaptive designs of Section [2] apply naturally to this 2-arm setting. The GLR 
statistic (II. ip at the ith stage is 



n 



ft log ( ^- ) - (1 -Pi) log ( \ — I 1 ) + % log ( ^ q% - ) - (1 - qi) log ' 



where pi , % are the fraction of successes in the treatment and control arms and p$. is the 
MLE of p under p — q = 8j, j = 0, 1, where 5q = and S\ is the implied alternative (see the 
example below). The boundaries b, b and c can be computed using a normal approximation 



or Monte Carlo simulations as described in Section [231 the latter, with 1 million simulations 
for each probability calculation, is used in the following comparative study. 



Thall and Simon! (119941 ) describe a phase II trial of fludarabine + ara-C + granulocyte 



colony stimulating factor (G-CSF) for treatment of acute myelogenous leukemia (AML). 
Although this trial was designed using other methods described in thei r pap er, we use it 
here as a real setting to compare the adaptive design with Thall et al.'s (119881 ) design. The 
standard therapy for AML at the time was fludarabine + ara-C alone, and in a preceding 
study, 22 out of 45 patients achieved complete remission of the leukemia, the clinical endpoint 
of interest, suggesting an initial estimate of q of .5. The study was conduc ted t o detect an 
increase in remission rate of 5 = .2. For a = .05 and (3 = .2, Thall et al.'s (119881 ) optimal 2- 
stage design (denoted by Opt2) for detecting a 20% improvement when the control remission 
rate q is .5 has first stage of 33 per arm, followed by a second stage of 45 per arm if Z\ > 
yi = .356, and rejecting the null hypothesis after the second stage if Z 2 > y 2 = 1.584. The 
adaptive design (denoted by ADAPT) with first stage m = 25 and maximum sample 78 per 
arm, the same as Opt2, uses boundaries b = 2.12, b = 1.03 and c = 1.56 for a = .05, a = .2 
and e = e = 1/2. Table 3 contains the operating characteristics of ADAPT and Opt2 for 
a variety of treatment and control response rates (p, q) around (.7, .5). Each result is based 
on 100,000 simulations. ADAPT has substantially smaller expected sample size than Opt2. 
This is in part because Opt2 only stops early for futility, although the parameters of Opt2 
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Table 3. Expected sample size, power (in parentheses), expected number of stages (in 
brackets) and average expected sample size (13. ip (denoted by AvSS) of 2-arm phase II 
designs 



q p 




ADAPT 






Opt2 




.4 .3 


33.3 


(0.4%) 


[1.1] 


37.8 


(0.2%) 


[1.1] 


.4 


46.1 


(5.3%) 


[1.5] 


48.9 


(5.3%) 


[1.4] 


.5 


57.5 


(32.3%) 


[1.8] 


63.3 


(35.6%) 


[1.7] 


.6 


56.4 


(76.0%) 


[1.8] 


73.5 


(78.9%) 


[1.9] 


.7 


43.8 


(97.0%) 


[1.5] 


77.3 


(97.7%) 


[2.0] 


AvSS 


51.3 






61.2 






.5 .4 


34.7 


(0.4%) 


[1.2] 


38.2 


(0.2%) 


[1.1] 


.5 


47.3 


(5.0%) 


[1.5] 


49.0 


(5.6%) 


[1.4] 


.6 


57.5 


(32.2%) 


[1.8] 


63.3 


(35.5%) 


[1.7] 


.7 


55.1 


(77.8%) 


[1.8] 


73.7 


(80.4%) 


[1.9] 


.8 


41.0 


(97.6%) 


[1.4] 


77.5 


(98.2%) 


[2.0] 


AvSS 


51.2 






61.4 






.6 .5 


34.7 


(0.4%) 


[1.2] 


38.2 


(0.2%) 


[1.1] 


.6 


46.0 


(5.2%) 


[1.5] 


48.9 


(5.3%) 


[1.4] 


.7 


55.8 


(33.2%) 


[1.7] 


63.3 


(35.6%) 


[1.7] 


.8 


52.3 


(81.1%) 


[1.7] 


74.4 


(84.2%) 


[1.9] 


.9 


35.9 


(98.5%) 


[1.3] 


77.8 


(99.4%) 


[2.0] 


AvSS 


49.2 






61.7 







in this case are chosen to minimize (13. ip . yet there is substantial savings both when p = q 
and p = q + 5. ADAPT and Opt2 have similar expected number of stages near the null 
hypothesis, with ADAPT decreasing as p — q increases while Opt2 steadily increases to 2, 
again due to its early stopping only for futility. The power functions of the tests are similar, 
with Opt2 having slightly higher power. Note that the Type I error probability of Opt2 
is inflated above a = .05 at p = q = .5 due to the normal approximations to used to 
compute the design parameters. 
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3.3 Comparison with Adaptive Tests for Difference of Means with 
Unknown Variances 



Let Xi,X2,-- - and Yi, Y2, ... be independent normal observations with unknown means 
fix,^Y and variances a x ,a Y , respectively. Even if the variances <j\,Oy are assumed equal, 
no fixed sample size test of the hypothesis fix < A*y can achieve specified error probabilities 
a, a at fix — equal to an d some specifie d 5 > without knowing the true value of the 
variance, as demonst rated by iDantzia (119401 ) . To overcome this difficulty in the case of a 
single normal mean, ISteinl (119451 ) proposed a two-stage procedure that uses the first-stage 
sample to estimate the variance. The total sample size is then determined as a function of 
this estimate so that the test statistic, which uses the data from both stages to estimate 
the mean but only the first stage to estimate the variance, is exactly t-distributed under 
fix — fiy = at which the test has Type I error a; the power of the test at fix — /iy = <5 is 
strictly larger than 1 — 5. Stein's procedure can be easily exte nded to the mean di f ferenc e 
problem if a x = at, an d has been modified for clin i cal tri als by IWittes and Brittainl ( 1l990l ) , 



Birkett and Day! ( 119941 ). and iDenne and Jennisonl ( 119991 ). These modifications of Stein's 



Wittes and Brittain 



Denne and Jennison 



procedure use the overall sample to estimate both the mean difference and the common 
variance in modifying Stein's test s tatistic, and therefore may not maintain the Type I error 
probability at the prescribed level. IWittes and Brittainl ( 119901 ) also assume a prior estimate 
<7q (e.g., from previous studies in the literature) of the common variance to modify Stein's 
formula for the total sample size. 

In Table 4 we compare the performance of the adaptive test in Sec tion 12.11 f denoted by 
ADAP T) with Stein's te st, denoted by S, and th e modified versions of 
(1990, denoted by WB), Birkett and Davl (119941 denoted by BD) and 
( 19991 . denoted by DJ) in the c ontext of a phase II hypercholesterolaemia treatment efficacy 
trial described by iFaceyl ( 119921 ). In this trial, patients were randomized into treatment and 
placebo groups and serum cholesterol level reductions, Xi and Y i; assumed to be normally 
distributed, were measured after four weeks of treatment. A difference in reductions of serum 
cholesterol levels, in mmol/liter, between the treatment and placebo groups of 1.2 was of 
clinical interest. Based on previous studies, it was anticipated that the standard deviation 
of the reductions would be about 0.7 for both groups. If the standard deviation were known 
to be <7 = 0.7, the size of the fixed sample i-test with er ror probabilities a = a = .0 5 at 
mean difference and 5 = 1.2 is 9 per group. Following IDenne and Jennisonl (119991 ). we 
assume a first-stage per-group sample size of m — 5, being approximately half of 9. If the 
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standard deviation were in fact 2<7o = 1.4, the per-group sample size of the same i-test is 
31, which we take as a reasonable maximum sample size M for our three-stage test with 
p m = .1 and e = e — 1/3. Table 4 contains the power and per-group expected sample 
size of ADAPT and the aforementioned procedures in the literature, evaluated by 100,000 
simulations at various values of fix — yUy £ [0, 5] and a = a x = cr Y . Whereas the Stein- type 
tests require this assumption of equal variances, the three-stage tests defined in Section 12.11 
do not, so for comparison we also include in Table 4 the three-stage test that does not 
assume ax = &y, which we denote by ADAPT^. The Kullback-Leibler information number 
I = mm{I((fi X , Hy, c 2 ), (fix, J^y, & 2 )) : J^x~J^y = 0}, where 1(9, 6) is defined in (ll.2p . is also 
reported in the first column. When the true standard deviations <Jx and oy are equal to the 
specified value <To, ADAPT and ADAPT^ have similar power but smaller expected sample 
size than the other tests for values of fix — fi-Y near and 5. When the standard deviations 
ax and oy are larger than the specified value a , the adaptive tests have much smaller 
expected sample sizes than the Stein-type tests, whose second-stage sample size increases 
without bound as a function of the first-stage sample variance; in particular, see the last 
three rows of Table 4. 

An alte rnative approac h to Stein-type designs has been used by lProschan and Hunsberger 



( 119951 ) and iLi et al.l ( 120021 ). who simply replace the a 2 in their two-stage tests that assume 
known variance with its current estimate at each stage. To compare ADAPT with these 
tests, which rely on stable variance estimates, we allow a larger first-stage sample size of 
m = 20. Table 5 contains the power and per-group expected sample size of Proschan & 
Hunsberg er's test ( denot ed by PH), two choices of the early stopping boundaries (h, k) in 
Table 1 of Li et al. J2OO2I ) for their test, which we denote by LI and L2, and our three-stage 
test ADAPT, for various values of fix — Hy and a, each entry being the result of 100,000 
replications. To compare these tests on equal footing we have chosen the maximum sample 
size M = 121 for ADAPT because this is the maximum sample size of LI and is quite close to 
the maximum sample sizes of PH and L2, which are 122 and 104, respectively. The PH, LI 
and L2 tests are designed to achieve Type I error probability .05 and they choose the sample 
size of their second stage based on a conditional power level of 80%. The threshold values 
b = 2.68, b = 1.75, c = 1.75 used by ADAPT are thus computed using a = .05, a = .20. The 
results in Table 5 show that the true power of LI, L2 and PH falls well below their nominal 
conditional power level of 80%. When a = 2, the LI, L2 and PH tests have power less 
than 50% for all values of fix — Hy considered, which is caused by stopping prematurely for 
futility at the end of the first stage; see in particular the rows in Table 5 that correspond to 
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Table 4. Power and per- group expected sample size of tests of Hq : jix < A*y 



(fix - Hy, o-) 
I 


s 


WB 


BD 


DJ 


ADAPT 


ADAPT^ 


(0,o-o/2) 


5.0% 


4.9% 


5.0% 


5.0% 


1.6% 


1.8% 


1 = 


5.0 


9.0 


5.0 


5.0 


5.0 


5.0 


(0,<7 ) 


5.0% 


5.4% 


6.0% 


5.6% 


4.0% 


4.1% 


7 = 


10.1 


10.1 


10.1 


10.3 


9.4 


10.2 


(*/2,<7 ) 


53.0% 


59.5% 


55.4% 


57.3% 


65.0% 


68.1% 


7 = .169 


10.1 


10.1 


10.1 


10.3 


15.5 


13.7 


(Vo) 


96.6% 


98.0% 


95.8% 


96.4% 


97.8% 


98.5% 


I = .551 


10.1 


10.1 


10.1 


10.3 


9.4 


8.0 


(0,2<7„) 


5.0% 


5.5% 


5.5% 


4.6% 


5.0% 


5.3% 


/ = 


38.2 


38.2 


38.2 


30.7 


22.1 


22.7 


(S/2,2a ) 


50.3% 


50.0% 


49.7% 


53.8% 


44.0% 


44.5% 


I = .045 


38.1 


38.2 


38.2 


30.7 


25.5 


26.0 


(<J,2(To) 


95.2% 


89.1% 


91.3% 


92.9% 


91.9% 


91.4% 


/ = .169 


38.1 


38.2 


38.2 


30.7 


22.1 


20.2 


(0, 3<r ) 


5.0% 


5.3% 


5.3% 


4.6% 


5.1% 


5.2% 


1 = 


85.2 


85.2 


85.3 


67.7 


26.3 


26.7 


(0,5(70) 


5.0% 


5.4% 


5.4% 


4.7% 


5.2% 


5.1% 


/ = 


236 


235 


236 


186 


27.6 


28.6 


(0, 10a ) 


5.0% 


5.4% 


3.8% 


4.9% 


5.1% 


5.1% 


/ = 


942 


940 


943 


754 


28.7 


29.3 
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Table 5. Maximum sample size M, power and per-group expected sample size for the tests 
LI and L2 of Li et al, Proschan & Hunsberger (PH), and ADAPT 



(fix — fJ'Y, cr) 
I 


LI 


L2 


PH 


ADAPT 


(0,1) 


5.5% 


5.3% 


5.5% 


4.8% 


1 = 


26.3 


25.5 


25.9 


56.5 


(0,2) 


5.3% 


5.3% 


5.4% 


5.4% 


1 = 


26.2 


26.3 


25.9 


93.5 


(1/4,1) 


29.9% 


29.3% 


29.0% 


48.3% 


I = .016 


32.8 


31.0 


31.6 


76.1 


(3/8,1) 


49.5% 


48.7% 


48.3% 


77.5% 


I = .035 


34.5 


32.7 


33.1 


73.5 


(1/2,1) 


67.8% 


66.4% 


66.4% 


92.8% 


/ = .061 


34.3 


32.8 


32.7 


63.3 


(1/2,2) 


12.0% 


29.9% 


28.9% 


56.1% 


/ = .016 


29.1 


32.9 


31.7 


98.7 


(3/4,2) 


49.8% 


48.5% 


48.1% 


85.6% 


/ = .035 


34.5 


32.7 


33.2 


87.0 



l^x — fJ'Y = 0. Since the conditional power criterion is not valid when the estimated difference 
of means is near zero, the L and PH tests must stop for futility when this occurs even though 
the true difference of means may be substantially greater than zero. 
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3.4 Comparison of Tests Allowing Mid-course Modification of Max- 
imum Sample Size 



As pointed out in Section CD, ICui et al.l (119991 ) have proposed a method to modify the group 
size of a given group sequential test of H : 6 < i n response t o prot ocol amendments during 



interim analyses. In the example considered by ICui et al.l (119991 p. 854), the maximum 
sample size is initially M = 125 for detecting Q\ = .29 with power .9 and a = .025, but can 
be subsequently increased up to M = 500; their sample sizes are twice as large because they 
consider variance 2. They consider modifying the group size at the end of a given stage L 
if the ratio of conditional power at the observed alternative 6 nL to the conditional power at 
Q\ is greater than 1 or less than .8, in which case the group size is then modified so that the 
new maximum sample size is 

MAM(6 1 /6 nL ) 2 . (3.2) 



If (13. 2p is less than the already sampled til, error spending can be used to end the trial. 
The crux of this method is that the original critical values can be used for the weighted 
test statistic without changing the Type I error probability regardless of how the sample 
size is changed. Table 6 compares their proposed adaptive group sequential tests with FSS 
tests, standard (non-adaptive) group sequential tests, and the adaptive test described in 
Section I2~2l Each result is based on 100,000 simulations. All adaptive tests in Table 6 use the 
first-stage sample size m = 25, maximum sample size initially M = 125 with the possibility 
of extension up to M = 500, and Ty pe I error probab ility not exceeding a = .025, matching 



Cui et al 



the setting considered in Section 2 of 
vary between M = 125 and M 
where FSS125 has power 1 — 5 = .9, and 9 2 



()1999l ). Since the maximum sample size can 
500, the two relevant implied alternatives are 9\ = .29, 
15, where FSS 50 o has power .9. The values of 



the user-specified parameters of the tests in Table 6 are summarized below. 
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INSERT TABLE 6 ON THIS PAGE 
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ADAPT: The adaptive test, described in Section |2T2| that uses b = 3.48, b = 2.1 and 
c = 2.31 corresponding to e = e = 1/2, p m = .1 and M' = 250. 

FSS125, FSS500: The FSS tests having sample sizes M = 125 and M = 500, respectively. 

OBF| c : A one-sided O'Brien-Fleming group sequential test having five groups of size 
10 and that uses stochastic c urtailment futility stopping (7 = .9 in Section 10.2 
of Ijennison and Turnbulll (120001 )) with reference alternative 82 = .15; see below. 



C 4 , C 5 : Two versions the adaptive group sequential test of ICui et al.l (119991 ) that 
adjusts the group size at the end of the first stage; C 4 uses four stages and C 5 uses five 
stages. 

Cg C , C 5 PF : Two modifications of C 5 to allow for futilit y stopping; C 5 Qn uses stochast ic 



.9 in Section 10.2 of 



curtailment futility stopping (7 
and C p p. uses power family futility stopping (A = 1 in Section 4.2 of 
(120001 )). Both C 5 SC and Cp F use reference alternative 9 2 = .15 



Jennison and Turnbulll (2000)) 



Jennison and Turnbull 



Since OBF| c , C| c and Cp F have maximum sample size M = 500, the futility stopping 
boundaries of these tests are designed to have power .9 at 9 2 . We have also included C 4 
because our adaptive test uses no more than four stages. The tests are evaluated at the 9 
values where FSS125 has power .01, .025, .7, .8 and .9, and where FSS500 has power .7, .8 
and .9. 

Even though the C tests have maximum sample size M = 500, they are underpowered 
at < 9 < 82, the alternative implied by M, when compared with ADAPT, FSS500 and 
OBF| c . In particular, the C tests have power less than .65 at 6 2 . Since C 4 and C 5 use no 
futility stopping, this suggests that their updated maximum sample size (13.21) (with L — 1) 
has contributed to the power loss. The large expected sample sizes of C 4 and C 5 at 9 < 
reveal another problem with this sample size updating rule: It does not consider the sign of 
6 m ; a negative value of 6 m could result in the same sample size modification as a positive 
one, causing a large increase in the group size when it should be decreased toward futility 
stopping. ADAPT has only a slight loss of power in comparison with FSS500 and the five- 
stage OBFl^ at 9 > 0, with substantially smaller expected sample size. The mean number 
of stages of ADAPT at 9\ = .29 shows that it behaves like a two- or three-stage test there. 
OBF| c , on the other hand, has the largest expected sample size at 9 > of the tests in 
Table 6 other than FSS125. 
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4 ASYMPTOTIC THEORY AND A MODIFIED CONDITIONAL 
POWER TEST 



In this section we prove the asymptotic optimality of the adaptive tests in Sections 12.11 and 
12.21 The proof also sheds light on how the two-stage conditional power tests, which are shown 
to be severely under-powered in Section [31 can substantially increase their power by adding 
a third stage. These modified conditional power tests are still not asymptotically efficient 
because they try to mimic the optimal FSS test when the alternative is given whereas the 
adaptive test of Section [27T1 tries to mimic the SPRT instead, assuming H to be simple. 

Theorem 4.1. Let N denote the sample size of the three-stage GLR test in Section lKlj with 
m, M and m V {M A [(1 + p m )n(9 m )']} being the possible values of N. Let T be the sample 
size of any test of H : u(9) < uq versus Hi : u(9) > Ui, sequential or otherwise, which takes 
at least m and at most M observations and whose Type I and Type II error probabilities do 
not exceed a and a, respectively. Assume that log a ~ log a, 

m/\ logo;) — y a, M/\ loga| — > A, p m — > but m^p m /(logm) 1 / 2 — > oo (4.1) 

as a + a — )■ ; with < a < A. Then for every fixed 9, as a + a — > 0, 

E e (N) ~ m V (M A | loga|/{ inf 1(9, A) V inf 1(9, X)}), (4.2) 

X:u(X)=uo X:u(X)=ui 

E e (T) > (l + o(l))E e (N). (4.3) 



Proof. Let O = {9 : u(9) < u }, Qi = {9 : u(9) > Ul }. By (El}, for i = 0, 1, 



inf 1(9, X) = h(9), where h(9) = inf 1(9, A). (4.4) 

AeOi X:u(X)=Ui 



Take any A G Go and A G 6i. In view of (I4.4p and Hoeffding's (119601 ) lower bound for the 
expected sample size E e (T) of a test that has error probabilities a and a at A and A and 
takes at least m and at most M observations, 

E,(T) > m V |m A ii±£jffl^i j (4 ,) 

as a + a — > such that log a ~ log a. We next show that the asymptotic lower bound in 
(14. 5p is attained by N. Since m ~ a\ loga| and M ~ A\ loga| and since the thresholds b,b 
and c are defined by f l2.7l) -( |2T9l) . we can use an argument similar to the proof of Theorem 2(ii) 
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of lLai and Shihl (12004 p. 525) to show that (14.21) holds. In fact, the second-stage sample 
size of the adaptive test is a slight inflation of the Hoeffding-type lower bound (14.51) with 9 
replaced by the maximum likelihood estimate 9 m at the end of the first stage. The assumption 
Pm ~^ but p m y m -1 / 2 (log m) 1//2 is used to accommodate the difference between 9 and its 
substitute 9 m , noting that as m — > oo, 

P e {Vrn\9 m -9\> r(logm) 1/2 } = o(m _1 ) 

is r if sufficiently large, by standard exponential bounds involving moment generating func- 
tions. □ 
We can now extend the Hoeffding-type lower bound (14 .5p to establish the asymptotic 
optimality of the adaptive test in Section 12.21 that allows mid-course modification of the 
maximum sample size. This adaptive test can be regarded as a mid-course amendment of 
an adaptive test of H Q : u{9) < Uq versus Hi : u(9) > U\, with a maximum sample size of 
M, to that of H versus H 2 : u{9) > u 2 , with a maximum sample size of M. Whereas (14.51) 
provides an asymptotic lower bound for tests of Hq versus Hi, any test of Hq versus H 2 with 
error probabilities not exceeding a and 5 and taking at least m and at most M observations 
likewise satisfies 

as a+a —> such that log a ~ log 5. Note that 9i = {9 : u{6) > u{\ C 62 = {9 : u{9) > u 2 } 
and therefore l2{9) < h(9). The 4-stage test in Section [2.2[ with M' = M, attempts to 
attain the asymptotic lower bound in (14. 5[) prior to the third stage and the asymptotic lower 
bound in (14. 6 p afterwards. It replaces h(9) in (14.51) . which corresponds to early stopping for 
futility, by I 2 {9) that corresponds to rejection of H 2 (instead of Hi) in favor of H . Thus, 
the second-stage sample size n 2 corresponds to the lower bound in (14.51) with 9 replaced by 
9 m and Ii replaced by I 2l while the third-stage sample size corresponds to that in (14. 6p with 
9 replaced by 9 n2 . The arguments used to prove the asymptotic optimality of the 3-stage 
test in Theorem 4.1 can be readily modified to prove the following. 

Theorem 4.2. Let N* denote the sample size of the four-stage GLR test in Section{KM with 
M' = M. Assume that loga ~ loga as a + a — > 0, that holds and M/\ loga| —> A 

with < a < A < A. Then 



mV (l + o(l))\\oga\/I {9) if h{9) > A' 1 

m V [MA (1 + o(l))\ loga|/{/ o (0) V I 2 (9)}} if I o (0) < A' 1 . 
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The si mulation r esults in Table 5 show that the two-stage tests of iProschan and Hunsberger 



(119951 ) and lLi et al.l (120021 ). which use conditional power to determine the second-stage sam- 
ple size, have actual power much lower than the adaptive test of Section 12.11 Without 
assuming a prespecified alternative 9%, the usual approach in the literature on sample size 
re-estimation considers the case d — 1 and u(9) = 9 and determines the second-stage sample 
size via the conditional power criterion 

n(9) = min{n > m : P e (S n > c a , n \S m ) > 1-5}, (4.7) 

choosing n 2 = n(9 m ) if 9 m > 9 + 8 and stopping at the first stage due to futility otherwise, 
where Pg {S n > c a>n } = a and 8 is chosen "to set an upper bound to limit the sample 



size of the second stage"; see iLi et al.l (120021 p. 283). Although the conditional power given 
9m > 9q + 8 is at least 1 — a by choosing the second-stage sample size to be n(9 m ), the actual 
(unconditional) Type II error probability of the test at 9(> 9q) may substantially exceed 5 if 
m is not large enough since Pe{9 m < 9 + 8} may well exceed 5. Stopping due to futility at 
the end of the first stage when 9 m < 9 + 8 can lead to serious loss of power of the two-stage 
test. By allowing the test to have a possible third stage, we do not have to stop prematurely 
when 9 m falls below 9q, for which the conditional power criterion ( 14 .7p is not applicable. 
Thus, a three-stage test that uses conditional power to determine the second-stage sample 
size chooses n\ — m, n% — M and = mm{n(9 m ), M}, where n(9) is defined by (14. 7p if 
9 > 9 , and by 

n{9) =max{TO, \\\og ot\/ 1 {9, 9 x )~\} if 9 < 9 . (4.8) 

The rejection and futility boundaries are given by ( I2.2l) - (l2.4p as in the three-stage test of 
Section I2TT1 The asymptotic properties of the test, whose sample size is denoted by N, are 
given by the following. 

Theorem 4.3. Define rjg for 9 > 9q by 

9 < Ve <9 and I( Ve ,9 ) = I( Ve ,9). (4.9) 
Then as a + a — > 0, Eg(N) ~ n{9) if 9 < 9$ and 

Eq(N) ~ m V {M A I \oga\/I{r)e, 9 )} if 9 > 9 . (4.10) 



Proof. Suppose 9 > 9q. From (14. ip and the law of large numbers, it follows that Pe{9 m > 
9 } — T-lasa + 5— 7-0, and therefore 

P 9 {9 m > 9 and N = n(9 m ) AM} ^ I. (4.11) 
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Since n(9) is given by (|4.7j) in this case, application of Theorem 2(i) of lLai and Shihl (120041 ) 
then yields 

n(9) ~mV(\loga\/I(r]e,e)). (4.12) 

In view of (14. ip . we can apply Lebesgue's dominated convergence theorem to obtain (14.101) 
from (14.111) and (14.121) in this case. Note in this connection that I(i]q,9) is continuous in 9 
and that 9 m converges to 9 in probability. 

We next consider the case 9 < 9q. Then n(9) is given by (14.81) . which is less than 
M ~ | log a | /I (6*, 9i) as a + 5 — > such that logo; ~ log 5. By the law of large numbers, 
Po{Qm < $o an d N = n(9 m )} — > 1. Continuity of 1(9, 9i) and dominated convergence can 
then be used to show that Eg(N) ~ h(9). □ 

Since I(rj6,0 ) < 1(0, Q ) by (Q]l . it follows from and ( |4~T0|) that the three-stage 
test using the conditional power criterion (14. 7p is not asymptotically efficient. This is not 
surprising since ( 14. 7\i is the sample size for the level-a FSS test to have at least 1 — 5 power 
at the alternative 8(> 9 ). However, the optimal test with error probabilities a at 9q and a 
at 9 is Wald's sequential probability ratio test whose expected sample size is of the smaller 
order | log a\/ 1(9, 9$) under the assumptions of Theorem 4.1. 

5 DISCUSSION 

A major drawback of the commonly used conditional power approach to two-stage designs 
is pointed out in Section 13.31 The actual power can be much lower than the conditional 
power since the estimated alternative at the end of the first stage can be quite different from 
the actual alternative. In particular, if the estimated alternative falls in the region of the 
null hypothesis and misleads one to stop for futility, there can be substantial loss of power. 
On the other hand, early stopping for futility is critical for keeping the sample size of a 
conditional power test within a manageable bound M. Our three-stage test makes use of 
M to come up with an implied alternative which is used to choose the rejection and futility 
boundaries appropriately so that the test does not lose much power in comparison with the 
(most powerful) fixed sample size test of the null hypothesis versus the implied alternative. 
This idea underlying (I2.7D-02.9D that def ine the stopping boundaries of three-stage tests has 



been used earlier by lLai and Shihl (|2004| ) to develop efficient group sequential tests. 

Our approach estimates the second-stage sample size by using an approximation to Ho- 
effding's lower bound for the expected sample size of sequential tests satisfying a prescribed 
Type I error constraint and a Type II error constraint at the alternative that is estimated at 
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the end of the first stage. As shown in the simulation studies in iBartroff and Lail (120081 ) and 
in the asymptotic theory in Section HJ this ap proach yields adaptive tests tha t are compara- 
ble to the benchmark optimal adaptive test of iJennison and Turnbulll ( )2006al Jbl) for a normal 
mean, which assumes known variance and a specified alternative. Our approach can serve 
to bridge the gap between the two "camps" in the adaptive design literature: One camp fo- 
cuses on efficient designs, under restrictive assumptions, that involve sufficient statistics and 
optimal stopping rules, while the other camp emphasizes flexibility to address the difficulty 
of coming up with realistic alternatives at the design stage. As pointed out in Section [TJ 
our approach that is built on the foundations of sequential testing theory is able to resolve 
the dilemma between efficiency and flexibility. Like the "efficiency camp," it adheres to the 
GLR test statistics whose efficiency is well established in the theory of FSS tests. An impor- 
tant innovation is that it uses the Markov property to compute error probabilities when the 
fixed sample size is replaced by a data-dependent sample size that is based on the estimated 
alternative at the end of the first stage, like the "flexibility camp." 
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